ds_PWDB¶We will use topic model with gensim workflow to build a topic model based on Latent Dirichlet Allocation (LDA)
algorithm using noun phrases from ds_PWDB dataset, and explore strategies to effectively
visualize the results using plotly package.
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
import spacy
nlp = spacy.load("en_core_web_sm")
import pandas as pd
from sem_covid.services.data_registry import Dataset
from sem_covid.entrypoints.notebooks.topic_modeling.topic_modeling_wrangling.topic_visualizer import TopicInformation, \
generate_wordcloud, plotly_bar_chart_graphic
from sem_covid.entrypoints.notebooks.topic_modeling.topic_modeling_wrangling.lda_modeling import WordsModeling
from sem_covid.services.sc_wrangling.data_cleaning import clean_remove_stopwords, clean_text_from_specific_characters
/home/daycu/Work/sem-covid/venv/lib/python3.9/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
pwdb = Dataset.PWDB.fetch()
pwdb.head()
| identifier | title | title_national_language | country | start_date | end_date | date_type | type_of_measure | status_of_regulation | category | ... | funding | involvement_of_social_partners_description | social_partner_involvement_form | social_partner_role | is_sector_specific | private_or_public_sector | is_occupation_specific | sectors | occupations | sources | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| _id | |||||||||||||||||||||
| adc5c75937bc7f7198f534d08b85bd50c9521bfd3f319a090932b5d0bae54de0 | 1297 | Agreement on a teleworking regime | Convention du 20 octobre 2020 relative au régi... | Luxembourg | 10/20/2020 | None | Open ended | Bipartite collective agreements | Entirely new measure | Protection of workers, adaptation of workplace | ... | [Companies] | The agreement is a social partner initiative w... | None | None | No | Only private sector | No | [] | [] | [{'title': 'Accord signé entre partenaires soc... |
| 2372d71eb9ad6e6a70982e02bbe802db004ed49d91b2264c0a2e8e41571002cc | 864 | Special protection for COVID-19 risk groups at... | Besonderer Schutz von Risikogruppen | Austria | 05/06/2020 | 05/31/2021 | Temporary | Legislations or other statutory regulations | Entirely new measure | Protection of workers, adaptation of workplace | ... | [National funds] | The expert group which elaborated the definiti... | None | None | No | Not specified | No | [] | [] | [{'title': 'FAQs Risk groups', 'url': 'https:/... |
| 8735e268191e9e5cbd3d2a44ca53d297e31746b5f1e24b941db6225a25848353 | 1228 | Funds for innovative renewable projects in And... | Ayudas para proyectos de renovables en Andaluc... | Spain | 08/25/2020 | None | Open ended | Legislations or other statutory regulations | Entirely new measure | Promoting the economic, labour market and soci... | ... | [Employer, European Funds, Local funds, Nation... | No involvement reported | None | None | Yes | Not specified | No | [Electricity, gas, steam and air conditioning ... | [] | [{'title': 'LÍNEAS DE AYUDA PARA PROYECTOS REN... |
| 18bcd22116c46919e03a3345f793c3859855227ac942e69dd13cbfcd588e1044 | 183 | Waiver of advance payments for social and heal... | Odklad záloh sociálního a zdravotního pojištěn... | Czechia | 03/01/2020 | 08/31/2020 | Temporary | Legislations or other statutory regulations | Entirely new measure | Supporting businesses to stay afloat | ... | [No special funding required] | Social partners, who are members of the tripar... | None | None | No | Not specified | No | [] | [] | [{'title': 'CSSZ: Self-employed persons - coro... |
| b94d8aa95fbdeb1bb832b01fbe5d6e9bf9fc36fceb14f7ba370a963f472fe35b | 1550 | Financial Shield 2.0: Small and medium-sized e... | Tarcza Finansowa 2.0: małe i średnie przedsięb... | Poland | 01/01/2021 | None | Temporary | Legislations or other statutory regulations | New aspects included into existing measure | Supporting businesses to stay afloat | ... | [National funds] | None. | None | None | Yes | Only private sector | No | [Manufacture of paper and paper products, Prin... | [] | [{'title': 'The Act of 31 March 2020 amending ... |
5 rows × 27 columns
After importing and visualize the dataset table, we detected the columns that contains textual data and concatenate them to have an entire text to wotk with.
pwdb_descriptive_data = pwdb['title'].map(str) + ' ' + \
pwdb['background_info_description'].map(str) + ' ' + \
pwdb['content_of_measure_description'].map(str) + ' ' + \
pwdb['use_of_measure_description'] + ' ' + \
pwdb['involvement_of_social_partners_description']
pwdb_descriptive_data
_id
adc5c75937bc7f7198f534d08b85bd50c9521bfd3f319a090932b5d0bae54de0 Agreement on a teleworking regime During the C...
2372d71eb9ad6e6a70982e02bbe802db004ed49d91b2264c0a2e8e41571002cc Special protection for COVID-19 risk groups at...
8735e268191e9e5cbd3d2a44ca53d297e31746b5f1e24b941db6225a25848353 Funds for innovative renewable projects in And...
18bcd22116c46919e03a3345f793c3859855227ac942e69dd13cbfcd588e1044 Waiver of advance payments for social and heal...
b94d8aa95fbdeb1bb832b01fbe5d6e9bf9fc36fceb14f7ba370a963f472fe35b Financial Shield 2.0: Small and medium-sized e...
...
cb014a456b14c3621dd318a12e611f70c2a9636be9fe181072bd4bf5917a40fa Automatic extension of unemployment benefits D...
d233b17dc2b98f14269c2b22be78d93ec5ccf2a0013b86f09175c69353c5800b Extra subsidies to institutions in the cultura...
77d7e3c52aaf78bdfb1a1667641db1293bbff862440c547fc3f6e8ab4fbd0d4e Expanding business loan guarantee scheme The ...
be8e21b382b63dd838087d7864e4d49a4807d9cc5d71aaa82577383ece92c581 Collective working time reduction for companie...
c434820f8f8c0b1447ba2f6f62aed177b9597532d7fcd156c6f556532dd7d11e One-time premium for construction workers Comp...
Length: 1288, dtype: object
After we extract the data, one of the most important step is cleaning them. We will delete below characters from our text to make it cleaner.
unused_characters = ["\\r", ">", "\n", "\\", "<", "''", "%", "...", "\'", '"', "(", "\n"]
clean_text = clean_text_from_specific_characters(pwdb_descriptive_data, unused_characters)
clean_text
'[Agreement teleworking regime During COVID-19 crisis, teleworking identified vital pillar companies working prevent social hardship. Discussions representative social partners OGBL LCGB employer association UEL, Ministry Work Employment, led instance joint assessment teleworking level Economic Social Council CES. From there, discussions continued social partners inter-professional agreement signed social partners October 2020. Applied period years covering sectors Luxembourg with exception transport), agreement provides definition teleworking: * Teleworking identified form organisational work, conducted digital means usually company, transferred location employee lives. * The work considered teleworking applied occasional exceptional circumstances, remains 10 threshold annual working time. * Teleworking based written agreement employer employee, containing compulsory elements, example location telework takes place number hours, employee cant dismissed he/she accept teleworking scheme employer cant accountable employer refuses teleworking scheme. * The employer provide employee technical equipment efficiently work teleworking scheme. * The company staff delegation regularly informed teleworking schemes. All employees private sector with exception transport sector) covered Code Work concerned. The agreement social partner initiative Economic Social Committee. The representative trade unions OGBL LCGB, employer association UEL signed agreement. Special protection COVID-19 risk groups work In face Corona crisis, employers taken numerous measures protect employees infections. Depending activity, includes, example, possibilities home office, redesigning workplaces maintain safe distance, installed barriers plexiglass walls use personal protective equipment.The COVID-19 Law paved way defining specific risk groups, protective measures apply. Employed workers high risk developing illness infected Corona virus entitled work home office change working conditions meaning working conditions designed way infection COVID-19 impossible, taking account way work). Ultimately, possible, paid temporary leave absence obtained payment employer, reimbursed social security system wage costs including ancillary wage costs).An expert group consisting representatives Federal Ministry Social Affairs, Health, Nursing Consumer Protection, representative Federal Ministry Labor, Family Youth, representatives Medical Association representatives social security system held meetings based previous experience COVID-19 sufferers Austrias hospitals international scientific results identified groups people higher risk developing illness. The COVID-19 risk group regulation lists medical reasons indications) belonging COVID-19 risk group. Based indications, doctor issue COVID-19 risk certificate.The main medical indications are:1. advanced chronic lung diseases require permanent, daily, dual medication2. chronic heart diseases end organ damage require permanent therapy, ischemic heart diseases heart failure3. active cancer oncological pharmacotherapy chemotherapy, biologics) / radiation therapy past months, metastatic cancer ongoing therapy4. diseases need treated immunosuppression5. advanced chronic kidney disease6. chronic liver diseases organ remodeling decompensated liver cirrhosis Childs stage B7. pronounced obesity obesity grade III BMI = 408. diabetes mellitus9. arterial hypertension existing end organ damage, especially chronic heart kidney failure, uncontrollable blood pressure adjustment.These main medical indications divided described detail regulation.In addition, other, similarly diseases functional physical limitations provide special protection COVID-19 risk certificate.Most members risk group identified medication health insurance. They received letters social security institution legal provisions came force May. Furthermore, individual risk analysis obtained physician e.g. patients cancer therapy prescribed medication receive treatment hospital dialysis patients). The doctor carries risk assessment patient based recommendations individual risk analysis severe course disease. If underlying illness meets recommendations, COVID-19 risk certificate issued.Employers affected jointly consider special protective measures possible workplace. If possible, home office used. If possible either, workers belonging risk group entitled paid leave work. According information Minister 21 April 2020, 90,000 workers expected belong risk group. No exact data available information use paid leave available now. The expert group elaborated definition risk groups consisted representatives relevant ministries, doctors representatives umbrella association social security institutions. The latters board consists social partner representatives assumed included expert group meetings. Funds innovative renewable projects Andalusia Extremadura The Council Ministers agreed authorise Institute Energy Diversification Saving IDAE) aid investment thermal energy production facilities renewable energy sources Andalusia calls investment aid Electric power generation facilities renewable energy sources autonomous communities Andalusia Extremadura. In total, €136.3 million allocated boost renewable energies regions - €124.3 million Andalusia €12 million Extremadura. In calls, aid range average 20 30 cost projects, assigned, competitive competition, depending level maturity, technological innovation management, assessing link Just Transition Demographic Challenge job creation.These calls renewable energy promotion package initially endowed €316 million, calls rest Autonomous Communities finalised, collaboration respective autonomous governments. No data available. No involvement reported Expanding business loan guarantee scheme The case initiated Dutch government, specifically Dutch Ministry Economic Affairs Climate. This measure package emergency solutions deal COVID-19 taken combat outbreak effects.This measure permanent feature temporally expanded consequence emergency condition economy implementation social distancing measures taken government. Enterprises large SMEs) need offer guarantees applying bank loans bank guarantees. The Netherlands measure place issues, called Guarantee Enterprise Financing Garantie Ondernemersfinanciering, GO), government acts guarantor 50 enterprise’s loan. To help enterprise continuity, national government raised ceiling GO €400 million € €1,5 billion March 2020. The maximum guarantee enterprise €150 million. As April 2020, credit guarantee increased 50 75. This enables banks extend credit easily quickly, enterprises lend money faster. The GO ceiling raised €10 billion. No information available. The main, national, level social partners involved development emergency response measures. The national government indicates meeting regular, weekly basis, social partners consult measures. Collective working time reduction companies difficulties The measure intended businesses officially recognised difficulty restructuring. This official recognition specific support measures place try businesses afloat, new measure them. It allows employers reduce labour costs reducing working hours performed company. A company reduce working hours collectively respond reduced production and/or turnover COVID-19 crisis. The workers concerned receive converted salary supplemented lump-sum wage compensation, covers wage loss financed temporary reduction employers social security contribution.The measure reserved companies formally recognised difficulty undergoing restructuring approval period starting 1 March 2020 31 December 2020. This measure allows companies difficulty restructuring reduce wage costs long suffer reduced activity. The measure similar implemented 2009-2011 crisis. No data available uptake measure. It unclear extent social partners involved design matter, public documentation available. The measure regulated royal decree collective agreement signed social partners. One-time premium construction workers Compared sectors, construction sector affected COVID-19. The sector profits public recovery programme public investments infrastructure building retrofits. In 2020 collective bargaining round, construction workers union IG BAU raised demands wage rise, reimbursement transport costs construction site and, additionally, COVID-19 premium. The negotiations employer organisation failed collective bargaining partners involve arbitrator decided favour union demands. The social partners accepted arbitrator decision 17 September 2020 On 3 September 2020, arbitrator decided favour travel-to-and-from work reimbursements, wage increases, one-time COVID-19 premium. The COVID-19 premia €500 workers €250 apprentices. No taxes social security contribution apply premia, meaning workers receive deductions. The wage increase includes recognition workers travel expenses journey work. This wage increase equivalent 0.5 increase hourly wage. The total wage increases, January 2020, vary according different regions lie 2. No data point. 1) The 2020 collective bargaining round took place, failed. 2) All social partners accepted arbitrator decision 17 September 2020. 3) Employers cover cost premia wage increases.]'
doc = nlp(clean_text)
noun_phrases = [word.text for word in doc.noun_chunks]
separated_noun_phrases = [[nouns] for nouns in noun_phrases]
Let's visualize the noun phrases that we found
generate_wordcloud(str(noun_phrases))
To build the LDA topic model we will use WordsModeling class that contains the corpus and the dictionary,
created from detected noun phrases' keywords and weights.
word_modeling = WordsModeling(separated_noun_phrases)
LDA Training is executed using gensim library - LdaMulticore that creates 10 different topics based
on our noun phrases.Here are keywords from random generated topic
lda_training = word_modeling.lda_model_training()
lda_training.show_topic(8)
[('private sector', 0.023142157),
('teleworking scheme', 0.023142157),
('installed barriers plexiglass walls', 0.023041297),
('OGBL LCGB employer association UEL', 0.019949127),
('The measure', 0.003518124),
('This measure', 0.0034852996),
('arbitrator decision', 0.0034820212),
('arbitrator', 0.0034754046),
('Employers', 0.0034729026),
('place', 0.003470808)]
visualize = TopicInformation(lda_training, separated_noun_phrases)
In LDA models, each document is composed of multiple topics. Here we will know which document belongs
predominantly to which topic. The table is separated in 5 columns and contains the data based on LDA training
with extracted noun phrases. Each noun phrase represents a document that belongs to a list of keywords that act
as topics. To see the contribution of each document, there is Topic Percentage Contribution column
that shows us the influence of each noun phrase.
topic_visualizer = visualize.format_topic_sentences()
topic_visualizer
| Document Number | Dominant Topic | Topic Percentage Contribution | Keywords | Text | |
|---|---|---|---|---|---|
| 0 | 0 | 2.0 | 0.5500 | company, example location telework, discussion... | [COVID-19 crisis] |
| 1 | 1 | 5.0 | 0.5500 | home office, personal protective equipment, ac... | [identified vital pillar companies] |
| 2 | 2 | 4.0 | 0.5500 | The company staff delegation, example, compuls... | [social hardship] |
| 3 | 3 | 1.0 | 0.5500 | redesigning workplaces, employees infections, ... | [Discussions representative social partners] |
| 4 | 4 | 8.0 | 0.5500 | private sector, teleworking scheme, installed ... | [OGBL LCGB employer association UEL] |
| ... | ... | ... | ... | ... | ... |
| 287 | 287 | 7.0 | 0.1000 | teleworking scheme employer, OGBL LCGB, employ... | [place] |
| 288 | 288 | 0.0 | 0.1001 | teleworking schemes, numerous measures, Code W... | [All social partners] |
| 289 | 289 | 7.0 | 0.1001 | teleworking scheme employer, OGBL LCGB, employ... | [arbitrator decision] |
| 290 | 290 | 7.0 | 0.1001 | teleworking scheme employer, OGBL LCGB, employ... | [Employers] |
| 291 | 291 | 7.0 | 0.1001 | teleworking scheme employer, OGBL LCGB, employ... | [cost premia wage increases] |
292 rows × 5 columns
Here we are using the package plotly to create a bar chart to visualize the information in the
table above.
In first chart we can see the sum of documents in each topic and how predominant is the 7th topic comparing to other.
documents_per_topics = plotly_bar_chart_graphic("Documents belongs to Topics", topic_visualizer['Dominant Topic'],
topic_visualizer['Document Number'], 'Dominant Topic', 'Document Number')
documents_per_topics
In the second chart we see the topics, but with contribution. Even the 7th topic contains the most of representative text, but it has the lowest percentage of contribution, comparing to other.
percentage_per_topics = plotly_bar_chart_graphic("Topic Percentage Contribution per to Topics",
topic_visualizer['Dominant Topic'],
topic_visualizer['Topic Percentage Contribution'],
'Dominant Topic', 'Topic Percentage Contribution')
percentage_per_topics
Based on information that we have, let's view the most contributory topics.
visualize.select_most_illustrative_sentence()
| Document Number | Topic Number | Topic Percentage Contribution | Keywords | Representative Text | |
|---|---|---|---|---|---|
| 0 | 13 | 0.0 | 0.55 | teleworking schemes, numerous measures, Code W... | [exception transport] |
| 1 | 3 | 1.0 | 0.55 | redesigning workplaces, employees infections, ... | [Discussions representative social partners] |
| 2 | 0 | 2.0 | 0.55 | company, example location telework, discussion... | [COVID-19 crisis] |
| 3 | 10 | 3.0 | 0.55 | social partners, The COVID-19, employers, empl... | [social partners] |
| 4 | 2 | 4.0 | 0.55 | The company staff delegation, example, compuls... | [social hardship] |
| 5 | 1 | 5.0 | 0.55 | home office, personal protective equipment, ac... | [identified vital pillar companies] |
| 6 | 6 | 6.0 | 0.55 | scheme, exception transport sector, The employ... | [instance] |
| 7 | 30 | 7.0 | 0.55 | teleworking scheme employer, OGBL LCGB, employ... | [teleworking scheme employer] |
| 8 | 4 | 8.0 | 0.55 | private sector, teleworking scheme, installed ... | [OGBL LCGB employer association UEL] |
| 9 | 9 | 9.0 | 0.55 | All employees, safe distance, UEL signed agree... | [inter-professional agreement] |
Let’s compute the total number of documents attributed to each topic.
Here is the distribution of dominant topics in each document
dominant_topics, topic_percentages = visualize.topic_per_document(end=-1)
df = pd.DataFrame(dominant_topics, columns=['Document Id', 'Dominant Topic'])
dominant_topic_in_each_doc = df.groupby('Dominant Topic').size()
df_dominant_topic_in_each_doc = dominant_topic_in_each_doc.to_frame(name='count').reset_index()
df_dominant_topic_in_each_doc
| Dominant Topic | count | |
|---|---|---|
| 0 | 0 | 22 |
| 1 | 1 | 6 |
| 2 | 2 | 6 |
| 3 | 3 | 10 |
| 4 | 4 | 9 |
| 5 | 5 | 8 |
| 6 | 6 | 9 |
| 7 | 7 | 195 |
| 8 | 8 | 19 |
| 9 | 9 | 7 |
Total topic distinction by actual weight
topic_weightage_by_doc = pd.DataFrame([dict(topic) for topic in topic_percentages])
df_topic_weightage_by_doc = topic_weightage_by_doc.sum().to_frame(name='count').reset_index()
df_topic_weightage_by_doc
| index | count | |
|---|---|---|
| 0 | 0 | 27.807321 |
| 1 | 1 | 28.802138 |
| 2 | 2 | 28.801969 |
| 3 | 3 | 30.782964 |
| 4 | 4 | 30.296390 |
| 5 | 5 | 29.796438 |
| 6 | 6 | 30.295629 |
| 7 | 7 | 27.310046 |
| 8 | 8 | 27.807253 |
| 9 | 9 | 29.299854 |
Finally, pyLDAVis is the most commonly used, and a nice way to visualise the information contained in a topic model.
Below is the implementation for LdaModel().
visualize.visualize_lda_model()